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ABSTRACT 

In  this  report  we  describe  the  automatic  identification  of  speakers  from  their 
voices.  This  process  has  application  in  forensics  and  in  voice  actuated  security 
systems.  The  implementation  and  theoretic  underpinnings  of  a  statistical  based 
speaker  recognition  system  are  presented  in  addition  to  the  performance  of  the 
system  on  standard  speech  corpora.  In  a  speaker  verification  experiment,  the 
system  yielded  an  equal  error  rate  of  under  5%  when  identical  microphones  are 
used  for  testing  and  training. 
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Automatic  Speaker  Recognition  Using  Statistical  Models 


EXECUTIVE  SUMMARY 

The  performance  and  robustness  of  current  speaker  recognition  technology  is  such  that 
this  technology  may  be  operationalized  for  defence  and  civilian  application.  Applications 
include,  e.g.,  voice  actuated  security  systems,  such  as  door  locks,  computer  access  controls, 
or  telephone  banking  systems.  In  these  applications,  voice  access  may  be  more  convenient 
than  traditional  methods  requiring  users  to  remember  code  words  or  identification  num¬ 
bers.  Another  application  of  speaker  recognition  is  its  use  for  forensic  purposes. 

The  most  popular  speaker  recognition  techniques  are  those  based  upon  statistical  mod¬ 
elling.  These  techniques  are  reliable,  well  known,  easy  to  implement  and  have  substantial 
commercial  and  academic  support.  The  model  based  approach  has  led  to  the  development 
of  commercial  off-the-shelf  dictation  systems,  e.g.  Dragon  Dictate.  However,  the  source 
code  for  these  systems  is  generally  not  available.  Thus  their  modification  to  address  prob¬ 
lems  of  defence  interest,  apart  from  dictation  and  command  recognition  under  favorable 
conditions,  is  frequently  not  possible. 

This  report  describes  the  design  and  implementation  of  a  speaker  identification  system 
suitable  for  use  in  real-world  applications  such  as  those  mentioned  above.  The  system 
has  been  designed  using  a  combination  of  standard  signal  processing  techniques  found  in 
the  literature  and  other  techniques  developed  here.  The  performance  of  the  system  has 
been  measured  on  standard  evaluation  corpora.  When  compared  to  a  similar  signal  pro¬ 
cessing  system  from  the  literature  the  system  here  had  superior  performance  and  reduced 
computational  complexity. 
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1  Introduction 


Automatic  speaker  recognition  refers  to  the  process  of  recognizing  speakers  from  their 
voices.  This  process  may  be  performed  by  comparing  the  utterance  from  a  speaker  of 
unknown  identity  with  templates  or  models  of  various  speakers  of  interest.  The  degree  of 
similarity  between  the  models  and  the  utterance  is  then  used  to  make  a  decision. 

Examples  of  applications  for  automatic  speaker  recognition  systems  include  voice  ac¬ 
tuated  security  systems  such  as  door  locks,  computer  access  controls,  or  telephone  banking 
systems.  In  these  applications  voice  access  may  be  more  convenient  than  traditional  meth¬ 
ods  which  may  require  users  to  remember  code  words  or  a  personal  identification  number 
(PIN).  Another  application  of  speaker  recognition  is  its  use  for  forensic  purposes  [1]. 

The  speaker  recognition  problem,  referring  to  the  general  area  of  recognizing  speakers 
from  their  voices,  may  be  subdivided  into  smaller  problems.  Speaker  verification  is  the 
problem  of  deciding  if  an  utterance  is  from  a  particular  speaker  or  not.  Speaker  identifica¬ 
tion  is  the  problem  of  deciding  who  is  speaking  in  a  given  utterance.  Problems  involving 
only  a  set  of  known  speakers  are  closed  set  problems,  whereas  open  set  problems  may  in¬ 
volve  speakers  who  have  never  been  encountered  previously  and  for  whom  no  model  exists. 
When  the  actual  sequence  of  words  that  are  spoken  is  known  the  problem  is  text-dependent, 
and  when  it  is  unknown  the  problem  is  text-independent. 

The  speaker  recognition  techniques  considered  in  this  paper  are  those  based  on  pre¬ 
scribing  statistical  models  for  the  speakers  of  interest  and  training  these  models  with 
training  data.  Other  approaches  are  possible,  e.g.,  techniques  based  on  long-term-statistics 
[2,  3,  4],  and  neural  network  approaches  [5,  6].  Long-term-statistics  are  extreme  character¬ 
izations  of  the  spectral  characteristics  and  lack  discrimination  power  [7].  Neural  network 
techniques  have  achieved  similar  performance  to  statistical  model-based  systems  [5]  but 
their  performance  is  affected  strongly  by  network  architecture  and  the  quality  and  quan¬ 
tity  of  training  data  [5,  7,  8,  9].  In  addition,  neural  training  algorithms  learn  network  node 
“weights”  which  are  difficult  to  relate  to  the  physical  phenomena  being  modelled  [8]. 

The  most  successful  statistical  model  for  speaker  recognition  is  the  Gaussian  mixture 
model  (GMM)  which  is  related  to  the  hidden  Markov  model  (HMM)  used  extensively 
in  commercial  speech  recognition  systems.  These  statistical  models  are  generally  trained 
by  parameter  estimation  using  the  maximum  likelihood  (ML)  criterion.  Once  trained, 
classification  may  be  performed  using  the  maximum  a  posteriori  (MAP)  decision  rule. 
The  optimality  of  this  technique  is  discussed  in  [10,  11].  The  close  relationship  between 
GMMs  and  HMMs  will  allow  speaker  recognition  systems  developed  using  GMMs  to  take 
advantage  of  the  considerable  research  efforts  currently  being  undertaken  by  both  academia 
and  industry  in  HMM  based  speech  recognition. 

Generally  GMMs  are  used  to  model  a  feature  vector  obtained  from  the  speech  time  do¬ 
main  waveform.  The  most  commonly  used  feature  vector  consists  of  the  so-called  cepstral 
coefficients  whose  use  has  dominated  speech  and  speaker  recognition  over  recent  years. 
This  feature  vector  provides  high  data  reduction,  obtains  high  performance,  and  is  ro¬ 
bust  to  certain  channel  effects.  However  it  is  obtained  by  a  non-linear  transformation  and 
consequently  is  not  robust  to  additive  noise. 

In  this  report  we  develop  the  theory  and  discuss  the  implementation  and  performance 
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of  a  speaker  recognition  system  using  Gaussian  mixture  modelling  of  cepstral  coefficients. 
The  system  was  tested  using  the  evaluation  data  of  the  1996  National  Institute  of  Stan¬ 
dards  and  Technology  (NIST)  Speaker  Evaluation  Workshop.  The  NIST  workshop  ad¬ 
dresses  the  text-independent,  open  set,  speaker  verification  problem  and  comprises  specific 
testing  and  training  conditions  designed  to  exercise  systems  under  real-world  conditions. 
Using  the  NIST  evaluation,  the  system  was  compared  to  similar  GMM-based  systems 
produced  by  Massachusetts  Institute  of  Technology,  Lincoln  Laboratory  (MIT  LL).  The 
system  developed  here  had  superior  performance  compared  to  the  equivalent  MIT  LL  sys¬ 
tem  and  had  comparable  performance  to  the  MIT  LL  baseline  system  [12].  The  system 
developed  here  had  less  computational  complexity  in  that  it  used  significantly  less  GMM 
states. 

The  remainder  of  the  report  is  organized  as  follows.  Section  2  presents  the  statistical 
signal  processing  theory  underpining  the  speaker  recognition  system.  Section  3  presents 
the  implementation  details  of  the  system,  the  speech  corpora  used,  and  the  results  ob¬ 
tained.  Section  4  presents  the  conclusions  and  possible  avenues  for  further  work. 


2  Statistical  Speaker  Recognition  Theory 


Speaker  recognition  is  an  hypothesis  testing  problem.  The  optimal  decision  rule,  in  the 
sense  of  minimizing  the  probability  of  error,  is  the  maximum  a  posteriori  (MAP)  decision 
rule  [13].  The  MAP  rule  decodes  an  acoustic  utterance  y,  as  the  speaker  S,  for  which 

S  —  argmaxp(5|y)  (1) 

=  argmaxp(y|5,)p(5l)  (2) 


where  p(S\y)  is  the  a  posteriori  probability  measure  of  the  speaker  S  given  the  utterance, 
p(S)  is  the  a  priori  probability  measure  of  the  speaker,  and  p(y\S)  is  the  conditional 
probability  measure  of  the  utterance  given  the  speaker. 

Speaker  verification  constitutes  a  binary  detection  problem  in  which  case  the  MAP 
rule  may  be  represented  as  the  likelihood  ratio  test.  Thus  the  decision  rule  becomes 


pjyM  TT  7 

p(v\Sb) 

Background 


(3) 


where  p{y\Srr)  represents  the  probability  measure  of  the  speaker  of  interest  or  target 
speaker,  p(^j\Sb)  represents  the  probability  measure  of  the  non-target  or  background  speak¬ 
ers,  and  7  represents  the  decision  threshold.  There  are  two  types  of  decision  errors  in  the 
verification  problem.  The  first  is  that  of  deciding  that  the  utterance  is  a  target  speaker 
when  it  is  really  a  background  speaker.  The  second  is  deciding  that  it  is  a  background 
speaker  when  it  is  a  target  speaker.  These  two  errors  are  referred  to  as  false  alarms  and 
misses  respectively.  For  a  given  test,  the  probabilities  of  these  two  types  of  errors  may 
be  traded  off  against  each  other  by  varying  the  decision  threshold  7.  A  plot  of  the  false 
alarm  probability  versus  either  the  probability  of  miss  or  detection  is  referred  to  a  detec¬ 
tion  error  tradeoff  (DET)  curve  [12]  or  a  receiver  operating  characteristic  (ROC)  curve 
[13],  respectively.  Frequently  the  performance  at  the  point  of  equal  false  alarm  and  miss 
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probability  is  quoted  to  allow  easier  comparisons  between  tests.  The  error  at  this  point  is 
referred  to  as  the  equal  error  rate  (EER). 

The  probability  measures  used  in  the  above  tests  are  not  explicitly  available.  The 
tests  may  be  implemented  by  estimating  the  probability  measures  p{y\S)  from  training 
data,  and  using  these  estimates  in  the  test  as  if  they  were  the  actual  probability  measures. 
This  approach  is  referred  to  as  the  plug-in  (PI)  method.  The  estimation  procedure  is 
facilitated  by  prescribing  parametric  models  to  the  probability  measures.  This  has  the 
advantage  that  only  the  limited  parameter  sets  of  the  model  need  to  be  estimated  from 
the  training  data.  Parameter  set  estimation  is  generally  accomplished  using  the  maximum 
likelihood  (ML)  criterion.  Other  estimation  criteria,  such  as  the  maximum  mutual  infor¬ 
mation  (MMI)  approach  and  the  minimum  discrimination  information  (MDI)  approach, 
have  been  considered  in  the  literature.  These  methods  were  compared  in  [14],  When,  as  is 
usually  the  case,  the  training  sequence  is  considerably  longer  than  the  test  sequence,  ML 
parameter  estimation  combined  with  the  PI  approach  can  be  shown  to  be  asymptotically 
optimal  in  the  minimum  probability  of  error  sense  [10,  11]. 

The  most  common  parametric  probability  model  for  speaker  recognition  is  the  GMM 
which  as  an  example  of  the  more  general  HMM  used  extensively  in  speech  recognition 
systems.  HMMs  are  also  referred  to  as  probabilistic  functions  of  the  Markov  chains  [15,  1C] 
and  as  Markov  sources  [17,  18].  In  the  next  section,  we  present  some  of  the  salient  points 
of  the  GMM. 


2.1  GMM  models 

The  GMM  consists  of  a  finite  number  of  states  that  are  visited  according  to  a  state 
probability.  When  a  particular  state  is  visited,  a  random  process  is  generated  according 
to  a  probability  measure  that  is  associated  with  the  state.  This  output  random  process  is 
observable,  but  the  actual  state  from  which  the  process  originated  is  not.  Thus  the  state 
is  considered  hidden.  A  state  sequence  is  generated  as  the  process  evolves  with  time.  The 
GMM  is  completely  specified  by  a  parameter  set  consisting  of  the  state  probabilities  and 
the  parameters  of  the  state  probability  measures. 

We  now  present  the  standard  assumptions  of  the  GMM.  For  notational  convenience, 
we  suppress  the  conditioning  of  the  parameter  set  A  of  the  GMM  on  the  particular  speaker, 
as  all  speakers  may  be  treated  equally.  Let  y  —  {yt,t  =  €  5ft  be  a  sequence 

of  vectors  generated  by  an  GMM.  Let  s  =  {st,t  =  l,...,T},sj  £  {1,...,M},  be  the 
sequence  of  states  that  generated  y.  We  can  express  the  model  p(y |A)  as 

p(y\x)  =  X^IS’AMSIA)  (4) 

s£S 

where  S  is  the  set  of  all  possible  sequences  of  states,  p(y\s,  A)  is  the  output  probability 
of  y  given  the  state  sequence  s,  and  p(s |A)  is  the  probability  of  the  state  sequence,  s. 
The  observation  vectors  {yt}  are  assumed  to  be  independent  of  each  other  given  the  state 
sequence  {sj  and  the  state  sequence  is  assumed  to  be  independent.  Thus 

T 

p(y|s,A)  =  IJp(yt|st,A)  (5) 

t=  l 
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and  hence  m 

T 

p(y\x)  =  TlMpteM- 

sest-i 


The  state  output  probability  density  function  p{yt\st,  A)  is  Gaussian  thus 


p(yt\st,  A) 


exp  —  \  (yt  -  ~  Pst) 

(2n)K/2  det  R\{2 


(6) 

(7) 


where  RSt  and  fiSt  are  the  state  dependent  covariance  matrix  and  mean  respectively. 

The  parameter  set  A  of  the  GMM  must  be  estimated  from  training  data.  A  compu¬ 
tationally  efficient  algorithm  based  upon  the  expectation-maximization  (EM)  algorithm 
[19]  is  available  for  the  iterative  ML  estimate  of  the  parameter  set.  This  solution  was  first 
derived  by  Baum  et  al  [15,  16],  for  the  more  general  case  of  the  ML  estimate  of  the  pa¬ 
rameter  set  of  an  HMM.  The  solution  for  the  GMM  is  considerably  simpler  and  it  appears 
in  [20,  Eqs.  (5)-(7)]. 

An  alternative  approach  using  the  segmental  k-means  algorithm  may  also  be  per¬ 
formed.  In  this  case,  the  likelihood  function  is  approximated  by 

5^p(y,  s|A)  «  maxp(y,s|A).  (8) 

s&S 

This  approach  can  be  shown  to  yield  similar  results  under  certain  conditions  [21,  22,  23]. 
This  technique  finds  a  single  most  likely  sequence  of  states,  s* .  The  same  re-estimation 
formulas  of  [20,  Eqs.  (5)-(7)]  may  now  be  used  with  p(st\yt)  replaced  by 

p(st\yt)  =  $sts; 

=  J  1  s*  =  s*  (10) 

)  0  otherwise. 


The  ML  parameter  estimation  criterion  is  not  the  only  possible  criterion.  Other  param¬ 
eter  estimation  criteria  are  sometimes  used,  e.g.,  maximum  mutual  information  (MMI), 
maximum  discriminant  information,  (MDI),  or  minimization  of  the  empirical  error  rate, 
but  their  implementation  is  significantly  more  complicated  than  the  ML  approach. 

Both  GMMs  and  HMMs  are  widely  accepted  as  reliable  statistical  models  for  speech 
signals  and  they  have  been  successfully  applied  for  speaker  recognition,  speech  recognition, 
and  speech  enhancement  [24].  The  difference  between  the  two  models  is  that  in  the  HMM, 
the  state  transitions  are  Markovian,  whereas  the  state  transitions  of  the  GMM  are  not. 
In  [25]  it  is  reported  that  the  Markovian  state  transitions  of  the  HMM  do  not  provide  any 
extra  performance  for  the  speaker  identification  problem. 

There  are  two  main  interpretations  for  the  way  in  which  they  model  speech  signals.  In 
the  acoustic  modelling  interpretation  [26],  each  state  corresponds  to  a  given  configuration 
of  the  vocal  tract.  In  the  spoken  language  interpretation  [27],  each  state  represents  a 
particular  phoneme.  HMMs  and  GMMs  are  also  suitable  for  modelling  noise  sources 

[28,  29,  11]. 

Both  GMMs  and  HMMs  are  frequently  used  to  model  feature  vectors  obtained  from  the 
speech  waveform.  The  most  popular  feature  vectors  are  those  obtained  via  the  cepstrum 
[27].  For  speaker  recognition,  the  cepstral  vectors  are  modelled  by  GMMs  with  diagonal 
covariances.  The  next  section  describes  the  cepstrum. 
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2.2  Cepstral  coefficients 

Cepstral  techniques  have  been  applied  successfully  to  speech  signals  since  their  intro¬ 
duction  in  the  1960s  [27]  and  they  now  constitute  the  “standard”  speech  representation 
for  speech  and  speaker  recognition.  Picone  [30],  in  a  survey  of  modern  speech  recognition 
systems,  found  that  21  out  of  26  non-neural-network  based  speech  recognition  systems 
used  some  form  of  cepstral  processing. 

The  cepstrum  is  an  example  of  a  homomorphic  [27]  signal  processing  technique  and 
is  defined  as  the  inverse  Fourier  transform  of  the  log  of  the  power  spectral  density  of  the 
signal.  Thus  the  cepstral  coefficients  c(n)  are  given  by 

/7T  /jf.j 

logS(tjj)eJLjn  —  .  (11) 

The  cepstrum  has  the  ability  to  deconvolve  signals.  If  a  signal  u  is  assumed  to  have  been 
obtained  from  passing  an  excitation  signal  w  through  a  linear  filter  with  impulse  response 
g ,  then  it  may  be  described  as  follows 

u  —  w  0  g  (12) 

where  <g>  denotes  convolution.  In  the  frequency  domain  this  is  represented  as 

U{u>)  =  W(u)G{lo)  (13) 

where  U(u),W(u),  and  G(lu)  are  the  Fourier  transforms  of  u,w,  and  g  respectively.  If  we 
take  the  logarithm  of  both  sides 

logI7(w)  =  log  IF (w)  +  logG'(cn).  (14) 

Hence  in  the  log  frequency  domain,  the  components  due  to  the  excitation  and  filter  re¬ 
sponse  are  additive  and  hence  are  potentially  easier  to  separate.  It  is  this  idea  that  provides 
a  rationale  for  the  cepstral  analysis  of  speech  signals.  Considering  the  acoustic  model  of 
speech  production,  the  cepstrum  of  the  speech  signal  has  a  component  due  to  the  vocal 
tract  and  an  additive  component  due  to  the  excitation  produced  by  the  vocal  cords.  Thus 
conventional  signal  processing  techniques  may  be  used  to  separate  these  components. 

In  GMM-based  recognition,  the  cepstral  vectors  are  assumed  to  be  Gaussian  with 
non-zero  means  and  diagonal  covariances.  The  justification  for  the  diagonal  covariance 
assumption  is  provided  in  [31],  where  it  is  shown  that,  under  certain  regularity  conditions 
on  S(co):  for  every  two  fixed  positive  integers  k  and  l 

lim  lim  Kcov(c(k),c(l))  =  8ki  (15) 

L— too  K—> oo 

where  8m  —  1  if  k  =  l  and  zero  otherwise  and  c(k)  and  c(l)  are  the  /cth  and  Rh  em¬ 
pirical  cepstral  coefficients  obtained  from  a  length  L  smoothed  periodogram  estimate  of 
S(u>).  Thus  the  empirical  cepstral  coefficients,  calculated  by  a  smoothed  periodogram,  are 
asymptotically  uncorrelated  with  a  variance  of  l/K. 

Generally,  approximately  20  of  these  so  called  “first-order”  cepstral  coefficients  are 
kept  for  speaker  recognition  purposes.  Often  the  cepstral  feature  vector  is  appended  with 
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delta-cepstral  coefficients,  which  are  approximations  to  the  time  derivatives  of  the  cepstral 
coefficients  and  possibly  delta-delta  cepstra,  which  are  approximations  to  the  second  time 
derivatives  of  the  cepstral  coefficients  [32]. 

Estimation  of  the  cepstral  coefficients  from  data  requires  an  estimate  of  the  spectrum 
S(u>).  This  estimate  may  be  obtained  using  non-parametric  spectral  estimation  (e.g.  peri- 
odogram  or  Blackman- Tukey)  or  by  parametric  estimation  usually  based  on  autoregressive 
(AR)  modelling.  In  this  case,  a  recursion  is  available  to  obtain  the  cepstral  coefficients 
directly  from  the  AR  coefficients  without  explicitly  calculating  the  spectrum  [26,  p  230]. 


2.3  Robustness  of  Speaker  Recognition 

In  general,  speech  and  speaker  recognition  systems  work  very  well  when  trained  and 
tested  under  similar  conditions.  However  the  systems  are  often  not  robust  and  very  large 
degradations  in  performance  occur  when  there  is  mismatch  between  training  and  test 
conditions  [33].  Indeed,  this  lack  of  robustness  is  considered  the  major  stumbling  block  for 
the  widespread  introduction  of  both  speaker  and  speech  recognition  systems.  There  are 
two  distinct  causes  for  mismatch  between  training  and  testing  conditions:  additive  noise 
and  channel  distortions.  We  do  not  consider  the  effects  of  additive  noise  in  this  report  as 
there  are  many  tutorials  on  this  subject  (see,  e.g.,  [34,  35]  and  references  therein).  In  this 
report  we  concentrate  on  channel  effects  and  ways  to  combat  them. 

Channel  distortions  are  induced  by  variations  in  the  transmission  path  from  the  speaker 
to  the  recognition  system  input.  A  channel  has  numerous  components  including  the  acous¬ 
tics  of  the  room  in  which  the  utterance  is  made,  the  recording  microphone,  and  the  path 
from  microphone  to  the  recognition  system,  which  in  some  applications,  is  via  the  inher¬ 
ently  variable  public  telephone  network.  The  effects  on  a  signal  due  to  channel  variations 
are  referred  to  as  convolutional  noise.  Mismatch  occurs  when  the  convolutional  noise  is 
different  in  testing  than  it  was  during  training.  Mismatch  due  to  differences  in  recording 
microphones  can  have  a  particularly  large  impact  on  performance.  In  [33]  it  was  found  that 
very  large  degradations  in  performance  occurred  when  a  system  was  tested  and  trained 
using  different  microphones.  Room  acoustics  may  be  a  factor  as  often  training  of  the  sys¬ 
tem  is  performed  using  data  that  is  obtained  in  anechoic  chambers.  Unless  testing  is  also 
performed  in  such  environs,  mismatch  will  occur.  There  are  two  main  techniques  used  for 
channel  compensation.  These  are  cepstral  mean  subtraction  (CMS)  and  RASTA  filtering. 

Cepstral  Mean  Subtraction  CMS  is  a  straightforward  and  easy  to  implement  tech¬ 
nique  for  combating  channel  effects.  The  simplest  implementation  of  CMS  involves 
estimating  the  mean  of  each  cepstral  coefficient  over  the  entire  utterance.  The  result¬ 
ing  cepstral  mean  vector  is  then  subtracted  from  each  cepstral  vector  [36,  37].  The 
removal  of  the  mean  in  the  cepstral  domain  corresponds  to  the  removal  of  the  long 
term  spectral  characteristics  of  the  time  domain  signal.  The  assumption  of  CMS  is 
that  these  long  term  spectral  characteristics  are  due  to  channel  effects.  Techniques 
where  only  a  short-term  cepstral  mean  estimate  is  calculated  and  subtracted  have 
also  been  proposed  (see,  e.g.,  [38].) 

RASTA  The  relative  spectral  (RASTA)  [39]  technique  is  an  example  of  a  filtering  tech¬ 
nique  designed  to  suppress  constant  or  slowly  varying  signal  characteristics.  There 
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are  many  other  examples  of  similar  filtering  techniques  (see,  e.g.  [35]  and  the  ref¬ 
erences  therein.)  In  [39]  the  details  of  a  particular  filter  are  given  that  produced 
substantial  improvements  in  the  case  of  speech  corrupted  by  convolutional  noise. 

In  the  next  section,  we  describe  an  implementation  of  a  speaker  recognition  system 
based  on  the  theory  described  in  this  section. 


3  Implementation 

In  this  section  we  describe  the  implementation  of  a  speaker  recognition  system  based 
on  Gaussian  mixture  modelling  of  cepstral  coefficients.  We  discuss  the  calculation  of  the 
cepstral  coefficients,  important  modelling  details,  and  the  speech  corpora  used  to  test  the 
system. 


3.1  Preprocessing 

Cepstral  coefficients  for  a  given  speech  vector  are  estimated  from  a  spectral  estimate 
obtained  using  the  window  method  [40].  Specifically  the  speech  utterance  is  divided  into 
frames  of  K  =  100  samples  each  and  a  smoothed  periodogram  estimate  of  the  power 
spectral  density  is  obtained  for  each  frame.  For  each  frame,  a  bK  length  “super-frame”  is 
formed  by  concatenating  the  two  K  length  frames  either  side  of  the  frame  to  the  original 
frame.  The  autocorrelation  sequence  is  obtained  by  the  inverse  fast  Fourier  transform 
(FFT)  of  the  magnitude  squared  of  the  FFT  of  the  super-frame.  The  autocorrelation 
sequence  is  windowed  by  a  Hanning  window  of  length  K/ 3.  The  windowed  autocorrelation 
sequence  is  FFT’ed  to  form  an  estimate  of  the  spectrum.  See  [40]  for  an  analysis  of 
this  procedure  for  calculating  an  estimate  of  the  power  spectral  density.  The  cepstral 
coefficients  are  obtained  from  the  real  part  of  the  inverse  FFT  of  the  log  spectrum.  The 
20  cepstral  coefficients  saved  for  each  frame  forms  the  feature  vector  that  is  modelled  as 
an  GMM  with  diagonal  covariances. 


3.2  Modelling 

Both  the  background  speakers  and  the  individual  target  speakers  are  modelled  by  a 
GMM  with  M  =  20  states.  This  figure  is  substantially  less  than  the  2048  state  model 
used  in  the  MIT  LL  system  [12]  and  represents  a  considerable  computational  savings. 
The  Gaussian  mixtures  are  non-zero  mean  with  diagonal  covariance  matrices.  Training 
of  the  GMM  is  accomplished  using  the  segmental  K-means  algorithm,  see  section  2.1. 
The  training  algorithm  is  initialized  with  models  obtained  by  a  random  clustering  of 
the  training  data.  In  [25,  20]  this  simple  initialization  procedure  demonstrated  similar 
performance  to  more  elaborate  procedures  based  on  phonetic  clustering.  The  EM  training 
iterations  are  terminated  when  the  difference  in  likelihood  between  successive  iterations 
was  less  than  1%. 
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3.3  Speech  Corpus 

The  systems  described  above  have  been  implemented  and  evaluated  using  the  data 
and  tests  of  the  1996  NIST  Speaker  Recognition  Evaluation  Workshop.  This  workshop 
addresses  the  speaker  verification  problem  and  consists  of  a  speech  corpus  and  a  series 
of  tests  designed  to  evaluate  the  effects  of  utterance  length,  speaker  sex,  and  microphone 
type  on  speaker  recognition  performance.  More  details  of  the  evaluation  may  be  found  in 
[41,  12]. 

For  the  purposes  of  this  study  we  considered  a  subset,  consisting  of  male  speakers 
only,  of  the  full  evaluation.  The  models  were  trained  on  2  minutes  of  speech  from  2 
different  microphones.  The  testing  utterance  length  was  nominally  30  seconds.  Results 
for  this  test  were  split  according  to  the  microphone  used  during  testing.  The  so-called 
“matched”  results  are  those  where  the  testing  microphone  was  used  during  training  and 
the  “mis-matched”  results  are  those  for  when  the  different  microphones  were  used  during 
testing  and  training.  The  background  model  was  trained  using  speakers  from  the  1996 
development  corpus.  There  were  no  target  speakers  present  in  the  background  data. 


3.4  Results  and  discussion 

Fig.  1  shows  the  performance  under  matched  and  mis-matched  microphone  conditions. 


DET  curve,  96  NIST  Evaluation,  Male  2  handset 
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Figure  1:  Performance  of  GMM  system  for  matched  and  mis-matched  microphones 
The  equal  error  rates  (ERR)  in  Fig.  1  are  approximately  5%  and  14%  under  matched 
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and  mis-matched  conditions  respectively.  The  MIT  LL  “UBM-independent”  is  a  similar 
system  to  that  developed  here  in  that  it  used  independently  trained  target  models.  When 
tested  on  the  same  subset  of  the  NIST  evaluation  data  as  the  system  developed  here, 
the  UBM-independent  system  had  matched  and  mismatched  ERR  rates  of  8%  and  20% 
respectively.  The  performance  of  MIT  LL  system  was  significantly  improved  by  using  a 
speaker  adaptation  technique,  referred  to  as  “UBM-adapt”  in  [12].  In  this  case  the  EER’s 
were  approximately  3%  and  12%.  The  adaptation  technique  has  the  disadvantage  that 
the  so-called  “relevance  factor”  [12,  Eq.  (7)]  must  generally  be  determined  experimen¬ 
tally.  Large  increases  in  mis-matched  performance  were  obtained  in  [12]  via  microphone 
adaptation.  However  this  adaptation  is  difficult  to  implement  in  some  applications. 


4  Conclusion 

The  speaker  recognition  technology  presented  in  this  paper  has  a  level  of  performance 
sufficient  for  it  to  be  used  in  defence  and  civilian  applications.  The  signal  processing 
techniques  used  are  well-known,  easy  to  implement,  and  have  demonstrated  reliable  per¬ 
formance  by  way  of  their  use  in  commercial  dictation  systems.  The  system  illustrated  in 
this  report  provides  performance  comparable  and  in  some  cases  superior  to  that  of  other 
systems  in  the  literature.  Further  work  in  this  area  should  address  the  relatively  low 
performance  under  mis-matched  microphone  conditions. 
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